Canonical Huffman

Introduction

This article assumes knowledge of at least static huffman. It will teach how to gain a little bit of bpb. Also, it will lead to less memory requirements and faster decoding.

Canonical Huffman

Canonical huffman defines a set of rules and then based on the lengths of the codes and the symbols it makes the codes. So you don't need to save the codes along with the bitstream anymore. What are these rules by which can make the codes? They are the following:

- Codes with shorter code length have higher values than longer ones.

- Codes with the same length increase the code value as the symbol value increases.

Those rules lead to an efficient decoding and also coding because no more ceil(log2(alphabetsize)) bits may be 1. This formula means the following: ceil() rounds to an int, log2 means (Log base10 (x)) / (Log base10 (2)), and alphabetsize is the number of symbols which you'll code.

Pseudo code

The first thing we need are symbols and code lengths computed via huffman, then you can start to do the codes. The pseudo code to do that isn't that difficult. Look:

   - First  compute the first code for the maximum length, it is 'max lenght'
     bits long and its value is all 0s.
   - Set last_#_codes to the number of codes that the maximum length has.
   - Repeat from Maximum length to 0.
     - code = code + last_#_codes
     - code >>= 1
     - current length start_code = code
     - last_#_codes = number of codes in current length

Sort all the lengths and symbols alphabetically. Then go through the codelength and assign to the current code the start code value of that length, further increment that value and repeat. (Note that by incrementing the code value, you meet the second rule.)

Want an example? Let's say we have a file which huffman codes and the code lengths are the following:

                          Symbol    Length    Code
                            a         2       11
                            b         2       00
                            c         3       011
                            d         3       101
                            e         4       0100
                            f         4       0101
                            i         4       1000
                            g         5       10010
                            h         5       10011

Note that we'll not need the codes, so we don't need to compute them, only the code lengths. Also note that in the following process we don't need to keep track of the actual code length. (I assume that you have somewhere else in memory an array with the symbols and their lengths, and also you've probably reserved space for the codes.)

Before starting this loop we have to set up an array which contains the number of codes which have a given length. I don't think you need to learn how this is done. For our example it will be:

   #codes[16]={0,2,2,3,2,0,0,0,0,0,0,0}

We have 16 entries because we assume our codes to be 16 bits long. Let's start doing the start_codes:

   - count = maximum code                                 5
   - code = 0
   - start_code[count] = code
   - last_#_codes = #codes[count]                         #codes[5]=2
   - --count
   - Loop
     - code + last_#_codes = 0010                         0+2=2
     - code >> 1 = 0001
     - start_code[count] = code                           start_code[4]=0001
     - last_#_codes = number of codes in current length   last_#_codes=3
   - Repeat
     - code + last_#_codes = 100                          1+3=4
     - code >> 1 = 010
     - start_code[count] = code                           start_code[3]=010
     - last_#_codes = number of codes in current length   last_#_codes=2
   - Repeat
     - code + last_#_codes = 100                          2+2=4
     - code >> 1 = 10
     - start_code[count] = code                           start_code[2]=10
     - last_#_codes = number of codes in current length   last_#_codes=2
   - Repeat
     - code + last_#_codes = 100                          2+2=4
     - code >> 1 = 10
     - start_code[count] = code                           start_code[1]=010
     - last_#_codes = number of codes in current length   last_#_codes=0
   - Stop

Remember that we don't need to keep track of the length of the current code. In this case we don't care about what the start code for the length 1 is. We'll not use it. Now we have those start codes:

- start_code[5]=00000
- start_code[4]=0001
- start_code[3]=010
- start_code[2]=10
- start_code[1]=0

And those are the symbols and lengths:

                                Symbol    Length
                                  a         2
                                  b         2
                                  c         3
                                  d         3
                                  e         4
                                  f         4
                                  i         4
                                  g         5
                                  h         5

Then we sort the symbols and lengths, visit them in alphabetical order, and start to assign codes:

               Symbol    Length    Start code    Code    Incremented
                 a         2       10            10      11
                 b         2       11            11      00
                 c         3       010           010     011
                 d         3       011           l011    100
                 e         4       0001          0001    0010
                 f         4       0010          0l010   0011
                 g         5       00000         00000   00001
                 h         5       00001         00001   l00010
                 i         4       0011          0011    0100

And those codes are instantaneous ones, so it is no problem to decode them. Also both the first rule and the second are met, the second explicity, and for the first, look:

               Symbol    Length    Start code    Code    Value
                 a         2       10            10      16
                 b         2       11            11      24
                 c         3       010           010     8
                 d         3       011           l011    12
                 e         4       0001          0001    2
                 f         4       0010          0l010   4
                 g         5       00000         00000   0
                 h         5       00001         00001   1
                 i         4       0011          0011    6

Encoder memory usage

Remember the first rule, and what it ensured: no more than ceil(log2(alphabetsize)) bits may be 1. If we are encoding an alphabetsize with 256 symbols (probably literals) then, no more than 8 bits can be 1. So if our codes are 16 bits long then the msb will be 00000000XXXXXXXX where X means it can be 0 or 1. So when saving the codes in memory we just need to save the low part of it, because we know that the high part will be always 0. So when we have to write a code, we check if it's above 8. If it is, then we write the low part, and for the high just zeroes. We can even make make a special putbits routine which only shifts but doesn't OR (because we only write 0). Before that our data structures looked something like that:

- Byte, length of the code
- Word, code

Thus making it not fit in word boundaries, so we usually had to make it fit in dword boundaries with something like that:

- Byte, length of the code
- Byte, dummy
- Word, code

Accessing it was just a matter of a shift left 2. But now we know that we only need to store in memory the low part, so we can save it in that way:

- Byte, length
- Byte, low part of code

Thus making it fit in word boundaries and making it easy to access, also we could load it at once, and using less memory reduces the cache misses.

Saving the header

Well, this is the advantage of canonical huffman: We don't need to store the codes. Of course we still need to pass to the decoder the lengths and symbols. It can be done in the following way: First there's 16 bytes which mean the symbols that a given code length has. After that there's the symbols starting with those for the code length 1, then 2,... in alphabetical order. It's quite clear how to decode this, isn't it? Of course once you have the lengths and symbols you do again the codes and you can start to decode.

Closing words

The fact that long codes are filled with zeroes may lead to fast decoding. Look at http://www.compressconsult.com/huffman/.

If you find any error in this article or want something to improve email me.

Dario Phong, Barcelona 15-May-1999